Time Series Analysis

6 minute read

This a tutorial for Time Series Analysis with real-world data from my Ph.D. Inluding:

Time Plot, Seasonal Plot, Box Plot
Seasonality, Trend
Seasonal Subseries Plots
Time Series Decomposition
Detrend
Stationarity
Autocorrelation
Moving Average, Expanding and Exponentially Weighted Moving Average
Double and Triple Exponential Smoothing
Simple forecasting methods

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller, kpss, acf, grangercausalitytests
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf,month_plot,quarter_plot
from scipy import signal
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline 
sns.set_style("whitegrid")
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15) 

data = pd.read_csv('SoleaTimeSeriesDataset-Temperature-Salinity.csv')
data['year_month'] = pd.to_datetime(data['year_month'])
data['year'] = data['year_month'].dt.year
data['month'] = data['year_month'].dt.month
data.head()

	id	year_month	temperatureSurface	temperature100_300	temperature300_400	temperature100_500	temperatureMaxDepth	salinitySurface	salinity100_300	salinity300_400	salinity100_500	salinityMaxDepth	year	month
0	02008-01	2008-01-01	15.713940	13.886082	13.886082	13.088567	12.823628	37.147140	38.262170	38.262170	38.560493	38.584150	2008	1
1	02008-02	2008-02-01	14.792473	13.868667	13.868667	13.081740	12.822814	37.084457	38.264230	38.264230	38.563950	38.584694	2008	2
2	02008-03	2008-03-01	14.733397	13.963535	13.963535	13.082810	12.824340	36.935814	38.210846	38.210846	38.561314	38.586860	2008	3
3	02008-04	2008-04-01	15.850550	13.979778	13.979778	13.139173	12.829004	37.155983	38.196102	38.196102	38.553448	38.589336	2008	4
4	02008-05	2008-05-01	18.497694	13.637879	13.637879	13.128523	12.827576	37.341560	38.253180	38.253180	38.548040	38.588478	2008	5

Time Plot

For time series data, the obvious graph to start with is a time plot. That is, the observations are plotted against the time of observation, with consecutive observations joined by straight lines.

fig, ax = plt.subplots(figsize=(15, 6))
d = data[data['obs_id']==2]
sns.lineplot(d['year_month'], d['salinitySurface'] )

ax.set_title('Salinity over Time', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax.set_xlabel('Year', fontsize = 16, fontdict=dict(weight='bold'))
ax.set_ylabel('Salinity', fontsize = 16, fontdict=dict(weight='bold'))
plt.tick_params(axis='y', which='major', labelsize=16)
plt.tick_params(axis='x', which='major', labelsize=16)
ax.yaxis.tick_left() # where the y axis marks will be

png

The monthly data show strong seasonality within each year. There is no cyclic behavior and no trend.

Seasonal Plot and Box Plots

A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed.

variable = 'salinitySurface'
fig, ax = plt.subplots(figsize=(15, 6))

palette = sns.color_palette("ch:2.5,-.2,dark=.3", 10)
sns.lineplot(d['month'], d[variable], hue=d['year'], palette=palette)
ax.set_title('Seasonal plot of Salinity', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax.set_xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
ax.set_ylabel('Salinity Surface', fontsize = 16, fontdict=dict(weight='bold'))


fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

sns.boxplot(d['year'], d[variable], ax=ax[0])
ax[0].set_title('Year-wise Box Plot\n(The Trend)', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax[0].set_xlabel('Year', fontsize = 16, fontdict=dict(weight='bold'))
ax[0].set_ylabel('Salinity Surface', fontsize = 16, fontdict=dict(weight='bold'))

sns.boxplot(d['month'], d[variable], ax=ax[1])
ax[1].set_title('Month-wise Box Plot\n(The Seasonality)', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax[1].set_xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
ax[1].set_ylabel('Salinity Surface', fontsize = 16, fontdict=dict(weight='bold'))

<matplotlib.text.Text at 0x1faa0114a90>

png

Seasonal Subseries Plots

y = data[data['obs_id']==2][['year_month','temperatureSurface','salinitySurface']]
y = y.set_index('year_month')
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
month_plot(y['temperatureSurface'],ylabel='TemperatureSurface', ax=ax[0]);
month_plot(y['salinitySurface'],ylabel='SalinitySurface', ax=ax[1]);

png

Decomposition

y = data[data['obs_id']==2][['year_month','temperatureSurface']]
y = y.set_index('year_month')

from pylab import rcParams
rcParams['figure.figsize'] = 15, 12
rcParams['axes.labelsize'] = 20
rcParams['ytick.labelsize'] = 16
rcParams['xtick.labelsize'] = 16
decomposition = sm.tsa.seasonal_decompose(y, model='additive')
decomp = decomposition.plot()
decomp.suptitle('Temperature Decomposition', fontsize=22)

<matplotlib.text.Text at 0x1fa9a5cf6a0>

png

The Hodrick-Prescott filter separates a time-series yt into a trend component τ and a cyclical component ct. For monthly data lambda=129,600.

from statsmodels.tsa.filters.hp_filter import hpfilter

gdp_cycle, gdp_trend = hpfilter(y['temperatureSurface'], lamb=129600)
y['trend'] = gdp_trend
y['cycle'] = gdp_cycle

y[['trend','temperatureSurface','cycle']].plot(figsize=(15,6)).autoscale(axis='x',tight=True);

png

Measure the strength of trend. 0 for low, 1 for high

max(0,(1-decomposition.resid.var()/(decomposition.resid+decomposition.trend).var())[0])

0.10248782084731933

Measure the strength of seasonality

max(0,(1-decomposition.resid.var()/(decomposition.resid+decomposition.seasonal).var())[0])

0.9880830435213952

Detrend

detrended = signal.detrend(data[data['obs_id']==2][['year_month','temperatureSurface']]['temperatureSurface'].values)
plt.plot(detrended)

[<matplotlib.lines.Line2D at 0x1fa9f4cdb00>]

png

Stationarity

from statsmodels.tsa.stattools import adfuller
# check for stationarity
def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print('Augmented Dickey-Fuller Test: {}'.format(title))
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out['critical value ({})'.format(key)]=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")

adf_test(data[data['obs_id']==2][['year_month','temperatureSurface']]['temperatureSurface'],title='')

Augmented Dickey-Fuller Test: 
ADF test statistic       -2.864219
p-value                   0.049670
# lags used              10.000000
# observations          109.000000
critical value (5%)      -2.888444
critical value (1%)      -3.491818
critical value (10%)     -2.581120
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

# KPSS Test
result = kpss(data[data['obs_id']==2][['year_month','temperatureSurface']]['temperatureSurface'].values, regression='c')
print("\nKPSS Statistic: {}".format(result[0]))
print("P-Value: {}".format(result[1]))
for key, value in result[3].items():
    print('Critial Values:')
    print("   {}, {}".format(key,value))

KPSS Statistic: 0.16268094415564427
P-Value: 0.1
Critial Values:
   5%, 0.463
Critial Values:
   1%, 0.739
Critial Values:
   2.5%, 0.574
Critial Values:
   10%, 0.347


c:\users\jim\appdata\local\programs\python\python35\lib\site-packages\statsmodels\tsa\stattools.py:1278: InterpolationWarning: p-value is greater than the indicated p-value
  warn("p-value is greater than the indicated p-value", InterpolationWarning)

In the ADF test, the null hypothesis is the time series possesses a unit root and is non-stationary. So because the P-Value is <0.05 we reject the null hypothesis.

The KPSS test, on the other hand, is used to test for trend stationarity. The null hypothesis and the P-Value interpretation is just the opposite of ADH test.

Granger Causality Tests

Essentially we’re looking for extremely low p-values.

grangercausalitytests(data[data['obs_id']==2][['temperatureSurface','salinitySurface']],maxlag=2);

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=58.2046 , p=0.0000  , df_denom=116, df_num=1
ssr based chi2 test:   chi2=59.7099 , p=0.0000  , df=1
likelihood ratio test: chi2=48.3902 , p=0.0000  , df=1
parameter F test:         F=58.2046 , p=0.0000  , df_denom=116, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=1.5797  , p=0.2106  , df_denom=113, df_num=2
ssr based chi2 test:   chi2=3.2992  , p=0.1921  , df=2
likelihood ratio test: chi2=3.2539  , p=0.1965  , df=2
parameter F test:         F=1.5797  , p=0.2106  , df_denom=113, df_num=2

So, one time series is useful in forecasting the other.

Autocorrelation

# Lag PLots
from pandas.plotting import lag_plot
lag_plot(data[data['obs_id']==2]['temperatureSurface']);

png

fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(15, 6))
#autocorr = acf(data[variable], nlags=12) # just the numbers
plot_acf(data[variable].tolist(), lags=36, ax=ax[0]); # just the plot
plot_pacf(data[variable].tolist(), lags=36, ax=ax[1]); # just the plot

png

autocorr

array([1.        , 0.9326662 , 0.85668922, 0.79436762, 0.75302265,
       0.73118683, 0.72281034, 0.72513455, 0.74072217, 0.77355728,
       0.82080037, 0.86408744, 0.88016387])

Simple Moving Average, expanding and Exponentially Weighted Moving Average

y = data[data['obs_id']==2][['year_month','salinitySurface']]
y = y.set_index('year_month')
y['SMA3'] = y.rolling(window=3).mean() 
y.plot(figsize=(15,6));

png

y1 = data[data['obs_id']==2][['year_month','salinitySurface']]
y1 = y1.set_index('year_month')
y1['salinitySurface'].expanding().mean().plot(figsize=(15,6));

png

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

span = 3
alpha = 2/(span+1)
y.index.freq = 'MS'
y['SES3']=SimpleExpSmoothing(y['salinitySurface']).fit(smoothing_level=alpha,optimized=False).fittedvalues
y.head()

	salinitySurface	SMA3	SES3
year_month
2008-01-01	38.059185	NaN	38.059185
2008-02-01	37.981125	NaN	38.059185
2008-03-01	38.007427	38.015912	38.020155
2008-04-01	37.915913	37.968155	38.013791
2008-05-01	37.992910	37.972083	37.964852

y.plot(figsize=(15,6));

png

Double Exponential Smoothing

Where Simple Exponential Smoothing employs just one smoothing factor $\alpha$ (alpha), Double Exponential Smoothing adds a second smoothing factor $\beta$ (beta) that addresses trends in the data. Like the alpha factor, values for the beta factor fall between zero and one ($0<\beta≤1$). The benefit here is that the model can anticipate future increases or decreases where the level model would only work from recent calculations.

from statsmodels.tsa.holtwinters import ExponentialSmoothing

y['DESadd3'] = ExponentialSmoothing(y['salinitySurface'], trend='add').fit().fittedvalues.shift(-1)
y[['salinitySurface', 'SES3', 'DESadd3']].iloc[:12].plot(figsize=(15,6));

png

Triple Exponential Smoothing

Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data.

y['TESadd3'] = ExponentialSmoothing(y['salinitySurface'],trend='add',seasonal='add',seasonal_periods=12).fit().fittedvalues
y[['salinitySurface', 'TESadd3', 'DESadd3']].plot(figsize=(15,6));

png

Simple forecasting methods

Average method

Here, the forecasts of all future values are equal to the mean of the historical data.

Naïve method

For naïve forecasts, we simply set all forecasts to be the value of the last observation.

Seasonal naïve method

In this case, we set each forecast to be equal to the last observed value from the same season of the year.

Drift method

A variation on the naïve method is to allow the forecasts to increase or decrease over time, where the amount of change over time (called the drift) is set to be the average change seen in the historical data.

Share on

Twitter Facebook LinkedIn

Dimitris Effrosynidis

Hard Skills

Tools